This report explores a dataset containing quality and attributes for approximately 1600 bottles of red wine.
The numbers of rows and columns in the data are
## [1] 1599
## [1] 13
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.factor : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality quality.factor
## Min. : 8.40 Min. :3.000 3: 10
## 1st Qu.: 9.50 1st Qu.:5.000 4: 53
## Median :10.20 Median :6.000 5:681
## Mean :10.42 Mean :5.636 6:638
## 3rd Qu.:11.10 3rd Qu.:6.000 7:199
## Max. :14.90 Max. :8.000 8: 18
Our dataset consists of 13 variables, with almost 1,600 observations.
Mostly red wine quality is 5 or 6. Why most of the wine quality falls into 5 or 6? What makes wine of better quality? Are those too poory or highly evaluated wines outliers? I wonder what this plot looks like across other variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
As a result,
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1400 0.2800 0.2954 0.4400 1.0000
I omitted 0 values from citric.acid category and plotted again. There seems to be three peaks at 0.02, 0.24 and 0.49.
For skewed with long features data, I conducted log transformation to better understand their distribution and data is showed as below.
## wineQualityReds$residual.sugar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## log10(wineQualityReds$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.04576 0.27875 0.34242 0.36925 0.41497 1.19033
## wineQualityReds$chlorides
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## log10(wineQualityReds$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.921 -1.155 -1.102 -1.088 -1.046 -0.214
## wineQualityReds$total.sulfur.dioxide
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## log10(wineQualityReds$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7782 1.3424 1.5798 1.5638 1.7924 2.4609
## wineQualityReds$alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## log10(wineQualityReds$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9243 0.9777 1.0086 1.0158 1.0453 1.1732
After log transformation,
Total acidity is slightly skewed to right, with most wines of lesser acidity around 7.5g / dm^3.
I also created 3 categories divided by quality score:
This would help us to achieve a better understandings rough flow of “quality” feature.
When I plot quality.cut, it looks like a chart below.
## low middle high
## 63 1319 217
With this categorization, most of the red wines(1319 out of 1599) fall into “middle” section.
There are 1599 red wines in the dataset with 13 attributes (X, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality). Quality is ordered factorwith integer numbers from 3 to 8.
Other observations:
The main feature of this dataset is quality of red wines and its relation to other attributes. I would like to clarify what are the main factors to determine wine quality. I hope to create predictive model to wine quality combining some variables within the dataset.
Other features of fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, and alcohol are likely to affect wine quality. After reading some article online, I suspect volatile acidity and sulphates adversely affect to lower wine quality.
I created a new variable, ‘total acidity’, by adding fixed acidity and volatile acidity to citic acid, because total acidity is one of the important factor to evaluate wine quality in balance with sweetness and bitterness. Contrary to pH, which refers to strength of acidity, total acidity is total amount of all acids present.
I also created “quality.cut” categorization, which divide wine quality into three groups: “low”, “middle”, “high”. There are 63, 1319 and 27 wines in each category. I hope this categorization helps us to understand rough tendency of good wine / poor wine.
I log-transformed right skewed distributions, which is residual.sugar, chlorides, total.sulfur.dioxide and alcohol.
Ovserbations after log-tranformation:
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## total.acidity 0.99482800 -0.156620601 0.62825187
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.acidity 0.117473729 0.102183639 -0.158241719
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## total.acidity -0.10760684 0.68488647 -0.67314051
## sulphates alcohol quality total.acidity
## fixed.acidity 0.183005664 -0.06166827 0.12405165 0.99482800
## volatile.acidity -0.260986685 -0.20228803 -0.39055778 -0.15662060
## citric.acid 0.312770044 0.10990325 0.22637251 0.62825187
## residual.sugar 0.005527121 0.04207544 0.01373164 0.11747373
## chlorides 0.371260481 -0.22114054 -0.12890656 0.10218364
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606 -0.15824172
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029 -0.10760684
## density 0.148506412 -0.49617977 -0.17491923 0.68488647
## pH -0.196647602 0.20563251 -0.05773139 -0.67314051
## sulphates 1.000000000 0.09359475 0.25139708 0.15956033
## alcohol 0.093594750 1.00000000 0.47616632 -0.08426530
## quality 0.251397079 0.47616632 1.00000000 0.08570932
## total.acidity 0.159560329 -0.08426530 0.08570932 1.00000000
From the table above, I expect to see correlations between these features;
Relations of these features and R^2 values are drawn and calculated as below.
## R^2 value of residual.sugar and density
## [1] 0.1262263
## R^2 value of free.sulfur.dioxide and total.sulfur.dioxide
## [1] 0.4457785
## R^2 value of total.acidity and pH
## [1] 0.4531181
## R^2 value of citric.acid and pH
## [1] 0.2936601
Observations:
Also from the correlation table above, “fixed.acidity”, “volatile.acidity”, “citric.acid”, “chlorides”, “total.sulfur.dioxide”, “density”, “sulphates”, “alcohol”, and “total.acidity” have higher correlation with quality, so I am going to draw a Matrix chart with these features.
Quality
fixed.acidity
volatile.acidity
citric.acid
chlorides
total.sulfur.dioxide
density
About these correlations, I want to look closer at scatter plots and box plots.
Quality
Findings from box plots:
I calculated statistical scores of variables which might have correlation with quality.
Quality and volatile.acidity
##
## Call:
## lm(formula = quality ~ volatile.acidity, data = subset(wineQualityReds,
## volatile.acidity <= quantile(wineQualityReds$volatile.acidity,
## 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.78977 -0.54547 -0.01325 0.47198 2.92568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.55757 0.05841 112.27 <2e-16 ***
## volatile.acidity -1.74500 0.10503 -16.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7436 on 1596 degrees of freedom
## Multiple R-squared: 0.1474, Adjusted R-squared: 0.1469
## F-statistic: 276 on 1 and 1596 DF, p-value: < 2.2e-16
Quality and citric.acid
##
## Call:
## lm(formula = quality ~ citric.acid, data = subset(wineQualityReds,
## citric.acid <= quantile(wineQualityReds$citric.acid, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.01809 -0.59820 0.09909 0.50922 2.59711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.37360 0.03371 159.384 <2e-16 ***
## citric.acid 0.97651 0.10144 9.627 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7847 on 1595 degrees of freedom
## Multiple R-squared: 0.05491, Adjusted R-squared: 0.05432
## F-statistic: 92.68 on 1 and 1595 DF, p-value: < 2.2e-16
Quality and alcohol
##
## Call:
## lm(formula = quality ~ alcohol, data = subset(wineQualityReds,
## alcohol <= quantile(wineQualityReds$alcohol, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8489 -0.4065 -0.1787 0.5176 2.5909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.81782 0.17512 10.38 <2e-16 ***
## alcohol 0.36646 0.01672 21.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7083 on 1596 degrees of freedom
## Multiple R-squared: 0.2314, Adjusted R-squared: 0.2309
## F-statistic: 480.4 on 1 and 1596 DF, p-value: < 2.2e-16
From the result of R^2 scores,
## wine quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## wine quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## wine quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## wine quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## wine quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## wine quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
Although I suspected that citric.acid is one of the cause of wine fault, against my intuition minimum number is largest at quality 8.
While 2 < quality < 6:
##
## Call:
## lm(formula = quality ~ total.sulfur.dioxide, data = subset(wineQualityReds35,
## total.sulfur.dioxide <= quantile(wineQualityReds35$total.sulfur.dioxide,
## 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.89308 0.03613 0.10220 0.14153 0.17457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8159953 0.0221026 217.892 < 2e-16 ***
## total.sulfur.dioxide 0.0015732 0.0003368 4.671 3.56e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3354 on 741 degrees of freedom
## Multiple R-squared: 0.0286, Adjusted R-squared: 0.02729
## F-statistic: 21.82 on 1 and 741 DF, p-value: 3.564e-06
While 5 < quality < 9:
##
## Call:
## lm(formula = quality ~ total.sulfur.dioxide, data = subset(wineQualityReds68,
## total.sulfur.dioxide <= quantile(wineQualityReds68$total.sulfur.dioxide,
## 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3466 -0.3021 -0.2588 0.6578 1.8335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.3597848 0.0302518 210.229 < 2e-16 ***
## total.sulfur.dioxide -0.0021961 0.0006457 -3.401 0.000702 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4883 on 852 degrees of freedom
## Multiple R-squared: 0.0134, Adjusted R-squared: 0.01224
## F-statistic: 11.57 on 1 and 852 DF, p-value: 0.0007016
When I split the dataset into two groups: one is wine quality of 3-5, another is wine quality of 6-8, total.sulfur.dioxide explains only about 3% for wine of quality score 3-5 and 1% for wine of quality score 6-8.
fixed.acidity
From the chart matrix, fixed.acidity seems to have strong positive correlation with citric.acid and density.
##
## Call:
## lm(formula = fixed.acidity ~ citric.acid, data = subset(wineQualityReds,
## citric.acid <= quantile(wineQualityReds$citric.acid, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7891 -0.8217 -0.0324 0.8059 5.9591
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6870 0.0553 120.92 <2e-16 ***
## citric.acid 6.0283 0.1664 36.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.287 on 1595 degrees of freedom
## Multiple R-squared: 0.4515, Adjusted R-squared: 0.4511
## F-statistic: 1313 on 1 and 1595 DF, p-value: < 2.2e-16
From the R^2 = 0.45, the variance in fixed.acidity is explainted with citric.acid by about 45%.
volatile.acidity
From the chart matrix, volatile.acidity seems to have strong negative correlation with citric.acid.
##
## Call:
## lm(formula = volatile.acidity ~ citric.acid, data = subset(wineQualityReds,
## citric.acid <= quantile(wineQualityReds$citric.acid, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36880 -0.09851 -0.01599 0.07528 0.91314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.66686 0.00640 104.19 <2e-16 ***
## citric.acid -0.51458 0.01926 -26.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.149 on 1595 degrees of freedom
## Multiple R-squared: 0.3093, Adjusted R-squared: 0.3088
## F-statistic: 714.1 on 1 and 1595 DF, p-value: < 2.2e-16
Based on the R^2 value, citric.acid explains about 31% of the variance in volatile.acidity.
citric.acid
From the chart matrix, citric.acid seems to have moderate positive correlation with chlorides, density and sulphates.
##
## Call:
## lm(formula = chlorides ~ citric.acid, data = subset(wineQualityReds,
## chlorides <= quantile(wineQualityReds$chlorides, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07520 -0.01789 -0.00667 0.00479 0.36411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.076212 0.001832 41.603 < 2e-16 ***
## citric.acid 0.039227 0.005511 7.118 1.65e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04264 on 1595 degrees of freedom
## Multiple R-squared: 0.03079, Adjusted R-squared: 0.03018
## F-statistic: 50.67 on 1 and 1595 DF, p-value: 1.646e-12
##
## Call:
## lm(formula = density ~ citric.acid, data = subset(wineQualityReds,
## density <= quantile(wineQualityReds$density, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0073433 -0.0009327 0.0000387 0.0010650 0.0059090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9957862 0.0000747 13330.2 <2e-16 ***
## citric.acid 0.0035142 0.0002239 15.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001743 on 1595 degrees of freedom
## Multiple R-squared: 0.1338, Adjusted R-squared: 0.1333
## F-statistic: 246.4 on 1 and 1595 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = sulphates ~ citric.acid, data = subset(wineQualityReds,
## sulphates <= quantile(wineQualityReds$sulphates, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30786 -0.09673 -0.02761 0.06082 1.29107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.586730 0.006659 88.11 <2e-16 ***
## citric.acid 0.257854 0.020004 12.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1551 on 1595 degrees of freedom
## Multiple R-squared: 0.09434, Adjusted R-squared: 0.09377
## F-statistic: 166.1 on 1 and 1595 DF, p-value: < 2.2e-16
Based on the R^2 value, chlorides and sulphates explains about less than 10% of the variance in citric.acid, on the other hand, density does about 13%.
chlorides
From the chart matrix, chlorides seems to have moderate positive correlation with density and sulphates and negative correlation with alcohol.
##
## Call:
## lm(formula = chlorides ~ density, data = subset(wineQualityReds,
## density <= quantile(wineQualityReds$density, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.04630 -0.01651 -0.00836 0.00103 0.52340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.6723 0.6134 -7.617 4.42e-14 ***
## density 4.7752 0.6154 7.759 1.51e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04603 on 1595 degrees of freedom
## Multiple R-squared: 0.03637, Adjusted R-squared: 0.03577
## F-statistic: 60.21 on 1 and 1595 DF, p-value: 1.514e-14
##
## Call:
## lm(formula = chlorides ~ sulphates, data = subset(wineQualityReds,
## sulphates <= quantile(wineQualityReds$sulphates, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12513 -0.01901 -0.00440 0.00882 0.46687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.025119 0.004422 5.68 1.59e-08 ***
## sulphates 0.094452 0.006538 14.45 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04255 on 1595 degrees of freedom
## Multiple R-squared: 0.1157, Adjusted R-squared: 0.1152
## F-statistic: 208.7 on 1 and 1595 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = chlorides ~ alcohol, data = subset(wineQualityReds,
## alcohol <= quantile(wineQualityReds$alcohol, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.06279 -0.01865 -0.00872 0.00327 0.51344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.190591 0.011351 16.791 <2e-16 ***
## alcohol -0.009897 0.001084 -9.133 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04591 on 1596 degrees of freedom
## Multiple R-squared: 0.04966, Adjusted R-squared: 0.04907
## F-statistic: 83.41 on 1 and 1596 DF, p-value: < 2.2e-16
Based on the R^2 value, density and alcohol explains about less than 10% of the variance in chlorides, on the other hand, sulphates does about 12%.
total.sulfur.dioxide
From the chart matrix, total.sulfur.dioxide seems to have moderate negative correlation with alcohol.
##
## Call:
## lm(formula = total.sulfur.dioxide ~ alcohol, data = subset(wineQualityReds,
## alcohol <= quantile(wineQualityReds$alcohol, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.008 -23.217 -8.064 13.936 254.729
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113.9790 7.9572 14.32 <2e-16 ***
## alcohol -6.4804 0.7597 -8.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.18 on 1596 degrees of freedom
## Multiple R-squared: 0.0436, Adjusted R-squared: 0.043
## F-statistic: 72.76 on 1 and 1596 DF, p-value: < 2.2e-16
Based on the R^2 value, alcohol explains only about 4% of the variance in total.sulfur.dioxide.
density
From the chart matrix, density seems to have strong negative correlation with alcohol and moderate positive correlation with sulphates.
##
## Call:
## lm(formula = density ~ alcohol, data = subset(wineQualityReds,
## alcohol <= quantile(wineQualityReds$alcohol, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0049662 -0.0010828 -0.0002425 0.0008610 0.0073845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0060274 0.0004043 2488.42 <2e-16 ***
## alcohol -0.0008907 0.0000386 -23.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001635 on 1596 degrees of freedom
## Multiple R-squared: 0.2502, Adjusted R-squared: 0.2497
## F-statistic: 532.5 on 1 and 1596 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = density ~ sulphates, data = subset(wineQualityReds,
## sulphates <= quantile(wineQualityReds$sulphates, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0065505 -0.0011168 0.0000013 0.0011520 0.0067538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9956368 0.0001941 5130.025 < 2e-16 ***
## sulphates 0.0016875 0.0002869 5.881 4.96e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001868 on 1595 degrees of freedom
## Multiple R-squared: 0.02122, Adjusted R-squared: 0.02061
## F-statistic: 34.59 on 1 and 1595 DF, p-value: 4.956e-09
Based on the R^2 value, alcohol explains about 25% of the variance in density, but sulphates does only about 2%.
To sum up this section, although volatile.acidity, citric.acid, total.sulfur.dioxide and alcohol seemed to have correlation with quality from boxplots, their R^2 value was not high. There might be too many outliers or other factors to affect quality. Also, minimum citric.acid was biggest within the data subset of quality score 8. This was against my expectation that citric.acid adds unpleasant taste to wine. Again, based on the R^2 value, correlation between alcohol and density was unexpectedly high at 25%.
From observation of box plots which include quality and other features’ data:
From observation of matrix chart of some attributes(fixed.acidity, volatile.acidity, citric.acid, chlorides, total.sulfur.dioxide, density, sulphates, alcohol, and total.acidity:
From observation of a correlation table which include all features:
With R^2 value basis, correlation between alcohol and density is unexpectedly high at about 25%. After I searched online to know that alcohol density is 0.8g/ml, this looks more natural now though.
Based on R^2 value to explain the variance in quality, alcohol explains about 23% and the most. Although its R^2 score is not as high as alcohol, volatile.acidity also explains variance in wine quality by anout 15%.
From the bivariate plots, I expect to find more relations with wine quality and other attributions; volatile.acidity, citric.acid, total.sulfur.dioxide, density, sulphates, and alcohol. Also, since alcohol seems to be one of the important factor for good wine, I would like to explore more about relations between alcohol and seemingly correlated attributions; chlorides, total.sulfur.dioxide and density.
Alcohol and chlorides, total.sulfur.dioxide and density
I created scatter plots of conbinations of chlorides and density, total.sulfur.dioxide and density, total.sulfur.dioxide and chlorides, conducting log-tranformation for skewed distribution data.
From the plots, chlorides and density combination seems to have the strongest correlation with alcohol among three. I wonder how it looks like when it is divided by quality and removed outliers.
By removing a part of density data, which is above third quartile, and dividing by quality, it became clearer that as the higher density a wine has, the lower percentage of alcohol it contains. Also, a wine with density below 0.9925 is likely to fall into quality score 6 and wine with density above 1.0000 is likely to have quality score 5 or 6.
Quality and volatile.acidity, citric.acid, total.sulfur.dioxide, density,
sulphates, and alcohol
Ovserbing the bivariate plots, I suspected there are different trends between wines of low quality and middle quality and high quality. So I have plotted ’ volatile.acidity, citric.acid, total.sulfur.dioxide, density, sulphates, and alcohol combinations colored by quality.cut. From the univariate and bivariate analysis, I expected that citric.acid, sulphates and alcohol have relatively positive correlation and that volatile.acidity and density have relatively negative correlation with quality. Also, total.sulfur.dioxide’s distribution seems to be close to standard diviation over quality. Combining positive and negative correlated features each other, I plotted three scatter plots: “alcohol and citric.acid”, “sulphates and citric.acid”, “total.sulfur.dioxide and citric.acid”, “volatile.acidity and density” and “volatile.acidity and total.sulfur.dioxide” colored by quality.cut. For skewed data, log transformation is applied, so that their tendency of distribution becomes clearer. When plotting citric.acid, data with citric.acid == 0 is removed, for easier observation of correlation.
Among three plots above, the plot of log10(sulphates) and citric.acid is showing the clearest correlation with quality and each attributions. Good wine tends to have more citric.acid and sulphates, apperaing on the upper right.
Since R^2 score in correlation with quality was highest for alcohol(25%) and volatile.acidity(15%), I have plotted a scatter plot with these two features, too.
Thus good wines are mainly grouped on the upper left with higher percentage of alcohol and lesser volatile.acidity.
I would like to explore more about correlations between log10(sulphates) and citric.acid, and volatile.acidity and alcohol.
##
## Calls:
## lm: lm(formula = log10.sulphates ~ citric.acid, data = wineQualityReds)
##
## ================================
## (Intercept) -0.238***
## (0.004)
## citric.acid 0.165***
## (0.012)
## --------------------------------
## R-squared 0.110
## adj. R-squared 0.109
## sigma 0.092
## F 197.186
## p 0.000
## Log-likelihood 1553.693
## Deviance 13.409
## AIC -3101.387
## BIC -3085.255
## N 1599
## ================================
It is interesting that correlation is turning to negative only for wine with quality score 8 and variance of log10(sulphates) is smaller.
##
## Calls:
## lm: lm(formula = volatile.acidity ~ alcohol, data = wineQualityReds)
##
## ================================
## (Intercept) 0.882***
## (0.043)
## alcohol -0.034***
## (0.004)
## --------------------------------
## R-squared 0.041
## adj. R-squared 0.040
## sigma 0.175
## F 68.138
## p 0.000
## Log-likelihood 515.359
## Deviance 49.139
## AIC -1024.718
## BIC -1008.587
## N 1599
## ================================
There is a strong correlation between alcohol and volatile.acidity for data points with quality == 3 or 8.
To summarize,
Volatile.acidity and alcohol seem to have the best combination to look at wine quality. Looking at a plot of volatile.acidity and alcohol, good wines are mainly positioned on the upper left with higher percentage of alcohol and lesser volatile.acidity. A plot of log10 transformation of sulphates and citric.acid also shows correlation each other. Here good wines are on upper right corner with bigger number of log10(sulphates) and more citric.acid.
On the plot of log10(chlorides) and density, by removing a part of density data, which is above third quartile, and dividing by quality, it became clearer that as the higher density a wine has, the lower percentage of alcohol it contains. Also, a wine with density below 0.9925 is likely to fall into quality score 6 or 7 or 8 and wine with density above 1.0000 is likely to have quality score 5 or 6. It is interesting that although there is no strong correlation, lesser density wine tends to be a good one.
This plot indicates correlation between alcohol and quality. Also, when density is below 0.9925 or above 1.000, quality seems to fall above 5.
From this plot, correlation between quality and citric.acid and log10(sulphates) can be seen.
This plot is showing correlation between quality and alcohol and volatile.acidity, and at the same time, stronger correlation between alcohol and volatule.acidity at score 3 and 8.
This red wine data contains 1599 data with 13 attribution. Not bwing familiar with wine tasting, I had to start from understanding what each attribute means and how they affect to wine taste. After all, there seemed to be no paired effect of wine taste to each attribute. As the amount of attributes varies, wine quality also varies non-linearly. That is why I used matrix chart to find relatively correlated attributes to quality. From this chart, I cound find alcohol, sulphates, density, total.sulfur.dioxide, citric.acid and volatile.acidity had relatively strong relation with quality. From online articles I read I was minunderstanding that citric.acid gives fault to wine taste, so I was feeling dtrange when I found positive correlation between quality and citric.acid. I searched again then correctly understood that citric.acid gives wine teste flaw only after they are consumed too much. For a future work, wine taste seems to be compounded with several features and balances of each features, it would be possible to investigate further for appropriate compositionand balance to predict a good wine.
[1] Kaggle. Red Wine Dataset. Retrieved from https://www.kaggle.com/piyushgoyal443/red-wine-dataset#wineQualityInfo.txt
[2] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[3] ScienceDirect. Wine Quality. Retrieved from https://www.sciencedirect.com/topics/food-science/wine-quality
[4] Wikipedia. acids in wine. https://en.wikipedia.org/wiki/Acids_in_wine
[5] Wikipedia. wine fault. Retrieved from https://en.wikipedia.org/wiki/Wine_fault#Acetic_acid
[6] Wine in Moderation.com. How many grams of alcohol in wine?. Retrieved from https://www.wineinmoderation.eu/en/articles/How-many-grams-of-alcohol-in-wine.154/